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Abstract 

Transformation-based learning has been success- 
fully employed to solve many natural language 
processing problems. It has many positive fea- 
tures, but one drawback is that it does not provide 
estimates of class membership probabilities. 

In this paper, we present a novel method for 
obtaining class membership probabilities from a 
transformation-based rule Hst classifier. Three ex- 
periments are presented which measure the model- 
ipp; accuracy and cross-entropy of the probabilistic 



candidate outputs. These uncertainty measures 
are useful in situations where both the classifi- 
cation of an sample and the system's confidence 
in that classification are needed. An example of 
this is a situation in an ensemble system where 
ensemble members disagree and a decision must 
be made about how to resolve the disagreement. 
A similar situation arises in pipeline systems, such 
as a system which performs parsing on the output 
of a probabiHstic part-of-speech tagging. 

Tr ansformation-based learning (TBL) ( Brill J 



dlassifier on unseen data and the degree to which 
the output probabilities from the classifier can be 
used to estimate confidences in its classification 
decisions. 

TViP rpgiiltg nf tViPgP PYpPT-imPntg gVinw tViat fnr 



1995) is a successful rule-based machine learning 



algorithm in natural language processing. It has 
been applied to a wide variety of tasks, including 
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part of spe ech tagging ( Roche and Schabes, 1995 
[Brill, 1995), no un phrase chunking QRamshaw ani 



Marcus, 1999|) , parsing ( Brill, 1996; IVilain aiic 
Day, 1996|) , spelling correction ( |Mangu and BrillJ 
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1997), preposi tional phrase attachment ( Brill anc 
dialog act tagging ( Samuel et 



Resnik ^ 1994D 
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In natural language processing, a great amount of 
work has gone into the development of machine 
learning algorithms which extract useful linguistic 
information from resources such as dictionaries, 
newswire feeds, manually annotated corpora and 
web pages. Most of the effective methods can 
be roughly divided into rule-based and proba- 
biHstic algorithms. In general, the rule-based 
methods have the advantage of capturing the 
necessary information in a small and concise set 
of rules. In part-of-speech tagging, for exam- 
ple, rule-based and probabilistic methods achieve 
q omparablc accuraciGS, but rule - based methods 



al., 199 j), segmentation and message understand- 
ing ( Day et al., 1997 ), often achieving state- 



of-the-art performance with a small and easily- 
understandable Hst of rules. 

In this paper, we describe a novel method 
which enables a transformation-based classifier to 
generate a probability distribution on the class 
labels. Application of the method allows the 
transformation rule list to retain the robustness of 
the transformation-based algorithms, while bene- 
fitting from the advantages of a probabilistic clas- 
sifier. The usefulness of the resulting probabilities 
is demonstrated by comparison with another state- 



d apturo the knowledge in a hundred or do oimplo 
rules, while the probabilistic methods have a 
very high-dimensional parameter space (millions 
of parameters) . 

One of the main advantages of probabilistic 
methods, on the other hand, is that they include a 
measure of uncertainty in their output. This can 
take the form of a probability distribution over 
potential outputs, or it may be a ranked list of 



of-the-art classifier, the C4.5 decision tree ( |Quin 
1993|) . The performance of our algorithm 
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^AU the experiments are performed on text chunking. 
The technique presented is general-purpose, however, and 
can be applied to many tasks for which transformation- 
based learning performs well, without changing the inter- 
nals of the learner. 



compares favorably across many dimensions: it 
obtains better perplexity and cross-entropy; an 
active learning algorithm using our system outper- 
forms a similar algorithm using decision trees; and 
finally, our algorithm has better rejection curves 
than a similar decision tree. Section 2 presents the 
transformation based learning paradigm; Section 
3 describes the algorithm for construction of the 
decision tree associated with the transformation 
based list; Section 4 describes the experiments 
in detail and Section 5 concludes the paper and 
outHnes the future work. 



2 Transformation rule lists 

The central idea of transformation-based learn- 
ing is to learn an ordered list of rules which 
progressively improve upon the current state of 
the training set. An initial assignment is made 
based on simple statistics, and then rules are 
greedily learned to correct the mistakes, until no 
net improvement can be made. 

These definitions and notation will be used 
throughout the paper: 

• X denotes the sample space; 

• C denotes the set of possible classifications of 
the samples; 

• The state space is defined as S ~ X x C. 

• TT will usually denote a predicate defined on 

• A rule r is defined as a predicate - class label 
- time tuple, (tt, c,t), c € C, i G N, where t is 
the learning iteration in which when the rule 
was learned, its position in the list. 

• A rule r — (7r,c, t) applies to a state {x,y) if 
7r(x) = true and c ^ y. 

Using a TBL framework to solve a problem as- 
sumes the existence of: 

• An initial class assignment (mapping from X 
to S). This can be as simple as the most 
common class label in the training set, or it 
can be the output from another classifier. 

• A set of allowable templates for rules. These 
templates determine the predicates the rules 
will test, and they have the biggest infiuence 
over the behavior of the system. 

• An objective function for learning. Unlike in 
many other learning algorithms, the objective 
function for TBL will typically optimize the 
evaluation function. An often-used method is 
the difference in performance resulting from 
applying the rule. 

At the beginning of the learning phase, the 
training set is first given an initial class assign- 
ment. The system then iteratively executes the 
following steps: 

1. Generate all productive rules. 

2. For each rule: 

(a) Apply to a copy of the most recent state 
of the training set. 

(b) Score the result using the objective func- 
tion. 

3. Select the rule with the best score. 

4. Apply the rule to the current state of the 
training set, updating it to refiect this change. 

5. Stop if the score is smaller than some pre-set 
threshold T. 



6. Repeat from Step 

The system thus learns a list of rules in a greedy 
fashion, according to the objective function. When 
no rule that improves the current state of the 
training set beyond the pre-set threshold can 
be found, the training phase ends. During the 
evaluation phase, the evaluation set is initialized 
with the same initial class assignment. Each rule 
is then applied, in the order it was learned, to the 
evaluation set. The final classification is the one 
attained when all rules have been applied. 

3 Probability estimation with 
transformation rule lists 

Rule lists are infamous for making hard decisions, 
decisions which adhere entirely to one possibility, 
excluding all others. These hard decisions are 
often accurate and outperform other types of 
classifiers in terms of exact-match accuracy, but 
because they do not have an associated proba- 
bility, they give no hint as to when they might 
fail. In contrast, probabilistic systems make soft 
decisions by assigning a probability distribution 
over all possible classes. 

There are many applications where soft deci- 
sions prove useful. In situations such as active 
learning, where a small number of samples are 
selected for annotation, the probabilities can be 
used to determine which examples the classifier 
was most unsure of, and hence should provide the 
most extra information. A probabilistic system 
can also act as a filter for a more expensive 
system or a human expert when it is permitted 
to reject samples. Soft decision-making is also 
useful when the system is one of the components 
in a larger decision-making pro cess, as is the cas e 
in speech recognition systems ( Bahl et al., 1989|), 
or in an ensemble sy stem like AdaBoost ( Freund| 
[and Schapire, 1997 ). There are many other 
applications in which a probabilistic classifier is 
necessary, and a non-probabilistic classifier cannot 
be used instead. 

3.1 Estimation via conversion to decision 
tree 

The method we propose to obtain probabilis- 
tic classifications from a transformation rule list 
involves dividing the samples into equivalence 
classes and computing distributions over each 
equivalence class. At any given point in time i, 
each sample x in the training set has an associated 
state Si{x) — {x,y). Let R{x) to be the set of rules 
ri that applies to the state Si{x), 

R{x) ~ {ri G Ti\ri applies to Si{x)} 

An equivalence class consists of all the samples 
x that have the same R{x). Class probability 
assignments are then estimated using statistics 
computed on the equivalence classes. 



An illustration of the conversion from a rule 
list to a decision tree is shown below. Table Q 
shows an example transformation rule list. It is 
straightforw ard to convert thi s rule list into a de- 
cision pylon ( Bahl et al., 1989 ), which can be used 
to represent all the possible sequences of labels 
assigned to a sample during the appHcation of the 
TBL algorithm. The decision pylon associated 
with this particular rule Hst is displayed on the left 
side of Figure 0. The decision tree shown on the 
right side of Figure |l| is constructed such that the 
samples stored in any leaf have the same class label 
sequence as in the displayed decision pylon. In 
the decision pylon, "no" answers go straight down; 
in the decision tree, "yes" answers take the right 
branch. Note that a one rule in the transformation 
rule list can often correspond to more than one 
node in the decision tree. 



Initial label = A 



If Ql and label=A then label^B 



If Q2 and label=A then label^B 



If Q3 and label=B then label^A 



Table 1: Example of a Transformation Rule List. 





Figure 1: Converting the transformation rule list 
from Table to a decision tree. 

The conversion from a transformation rule list 
to a decision tree is presented as a recursive 
procedure. The set of samples in the training set 
is transformed to a set of states by applying the 
initial class assignments. A node n is created for 
each of the initial class label assignments c and all 
states labeled c are assigned to n. 

The following recursive procedure is invoked 
with an initial "root" node, the complete set of 
states (from the corpus) and the whole sequence 
of rules learned during training: 

Algorithm: RuleListToDecisionTree 
(RLTDT) 

Input: 

• A set S of TV states {{xi,yi) . . . {xN,yN)) with 
labels Ui £ C; 

• A set 7?. of M rules (rg, ri . . . tm) where = 
Do: 



1. If 7?^ is empty, the end of the rule list has been 
reached. Create a leaf node, n, and estimate 
the probability class distribution based on the 
true classifications of the states in B. Return 
n. 

2. Let rj = {iTj^yj, ]) be the lowest-indexed rule 
in TZ. Remove it from TZ. 

3. Split the data in B using the predicate iTj and 
the current hypothesis such that samples on 
which TTj returns true are on the right of the 
split: 

Bl = {x G B\Trj{x) = false} 
Br ~ {x ^ B\'Kj{x) = true} 



> K, the split is 



4. If \Bl\ > K and \Br\ 
acceptable: 

(a) Create a new internal node, n; 

(b) Set the question: q{n) — ttj; 

(c) Create the left child of n using a recursive 
call to RLTDT{Bl,TZ); 

(d) Create the right child of n using a recur- 
sive call to RLTDT{BR,n); 

(e) Return node n. 

Otherwise, no split is performed using rj. 
Repeat from Step |l|. 

The parameter if is a constant that determines the 
minimum weight that a leaf is permitted to have, 
effectively pruning the tree during construction. 
In all the experiments, K was set to 5. 

3.2 Further growth of the decision tree 

When a rule list is converted into a decision tree, 
there are often leaves that are inordinately heavy 
because they contain a large number of samples. 
Examples of such leaves are those containing 
samples which were never transformed by any 
of the rules in the rule Hst. These populations 
exist either because they could not be split up 
during the rule list learning without incurring a 
net penalty, or because any rule that acts on them 
has an objective function score of less than the 
threshold T. This is sub-optimal for estimation 
because when a large portion of the corpus falls 
into the same equivalence class, the distribution 
assigned to it reflects only the mean of those 
samples. The undesirable consequence is that all 
of those samples are given the same probability 
distribution. 

To ameHorate this problem, those samples are 
partitioned into smaller equivalence classes by 
further growing the decision tree. Since a decision 
tree does not place all the samples with the same 
current label into a single equivalence class, it does 
not get stuck in the same situation as a rule list 
— in which no change in the current state of 
corpus can be made without incurring a net loss 
in performance. 



Continuing to grow the decision tree that was 
converted from a rule list can be viewed from 
another angle. A highly accurate prefix tree 
for the final decision tree is created by tying 
questions together during the first phase of the 
growth process (TBL). Unhke traditional decision 
trees which select splitting questions for a node 
by looking only at the samples contained in the 
local node, this decision tree selects questions by 
looking at samples contained in all nodes on the 
frontier whose paths have a suffix in common. An 
illustration of this phenomenon can be seen in 
Figure |l|, where the choice to split on Question 

3 was made from samples which tested false 
on the predicate of Question 1, together with 
samples which tested false on the predicate of 
Question 2. The result of this is that questions 
are chosen based on a much larger population than 
in standard decision tree growth, and therefore 
have a much greater chance of being useful and 
generahzable. This alleviates the problem of over- 
partitioning of data, which is a widely-recognized 
concern during decision tree growth. 

The decision tree obtained from this conversion 
can be grown further. When the rule list TZ is 
exhausted at Step |l|, instead of creating a leaf 
node, continue splitting the samples contained in 
the node with a decision tree induction algorithm. 
The splitting criterion used in the experiments is 
the information gain measure. 

4 Experiments 

Three experiments that demonstrate the effec- 
tiveness and appropriateness of our probability 
estimates are presented in this section. The 
experiments are performed on text chunking, a 
subproblem of syntactic parsing. Unlike full pars- 
ing, the sentences are divided into non-overlapping 
phrases, where each word belongs to the lowest 
parse constituent that dominates it. 

The data used in all of these experi ments is 
the C oNLL-2000 phrase chunking corpus ( |CoNLL 
2000). The corpus consists of sect ions 15-18 and 



sectio n 20 of the Penn Treebank ( Marcus et al. 
1993 ), and is pre-divided into a 8936-sentence 



(211727 tokens) training set and a 2012-sentence 
(47377 tokens) test set. The chunk tags are 
derived from the parse tree constituents, and the 
part-of- speech tag s were generated by the Brill 
tagger ([Brill, 1995). 



As was noted by Ramshaw & Marcus (|1999|), 
text chunking can be mapped to a tagging task, 
where each word is tagged with a chunk tag 
representing the phrase that it belongs to. An 
example sentence from the corpus is shown in 
Table ^. As a contrasting system, our results 
are compared with those produced by a C4.5 
decision tree system (henceforth C4.5). The 
reason for using C4.5 is twofold: firstly, it is a 
widely-used algorithm which achieves state-of-the- 
art performance on a broad variety of tasks; and 



Word 


POS tag 


Chunk Tag 


A.P. 


NNP 


B-NP 


Green 


NNP 


I-NP 


currently 


RB 


B-ADVP 


has 


VBZ 


B-VP 


2,664,098 


CD 


B-NP 


shares 


NNS 


I-NP 


outstanding 


JJ 


B-ADJP 










Table 2: Example of a sentence with chunk tags 



secondly, it belongs to the same class of classifiers 
as our converted transformation-based rule hst 
(henceforth TBLDT). 

To perform a fair evaluation, extra care was 
taken to ensure that both C4.5 and TBLDT 
explore as similar a sample space as possible. The 
systems were allowed to consult the word, the 
part-of-speech, and the chunk tag of all examples 
within a window of 5 positions (2 words on either 
side) of each target examplejj Since multiple 
features covering the entire vocabulary of the 
training set would be too large a space for C4.5 
to deal with, in all of experiments where TBLDT 
is directly compared with C4.5, the word types 
that both systems can include in their predicates 
are restricted to the most "ambiguous" 100 words 
in the training set, as measured by the number of 
chunk tag types that are assigned to them. The 
initial prediction was made for both systems using 
a class assignment based solely on the part-of- 
speech tag of the word. 

Considering chunk tags within a contextual win- 
dow of the target word raises a problem with C4.5. 
A decision tree generally trains on independent 
samples and does not take into account changes 
of any features in the context. In our case, the 
samples are dependent; the classification of sample 
i is a feature for sample z -f 1, which means that 
changing the classification for sample i affects 
the context of sample i + 1. To address this 
problem, the C4.5 systems are trained with the 
correct chunks in the left context. When the 
system is used for classification, input is processed 
in a left-to-right manner; and the output of the 
system is fed forward to be used as features 
in the left context of following samples. Since 
C4.5 generates probabilities for each classification 
decision, they can be redirected into the input for 
the next position. Providing the decision tree with 
this confidence information effectively allows it to 
perform a limited search over the entire sentence. 

C4.5 does have one advantage over TBLDT, 
however. A decision tree can be trained using the 
subsetting feature, where questions asked are of 
the form: "does feature / belong to the set F?". 
This is not something that a TBL can do readily. 



^Thp TRT, tRTTiplatps are similar to those used in 
Ramshaw and Marcus (1999). 



but since the objective is in comparing TBLDT to 
another state-of-the-art system, this feature was 
enabled. 

4.1 Evaluation Measures 

The most commonly used measure for evaluating 
tagging tasks is tag accuracy. It is defined as 



Accuracy : 



# of correctly tagged examples 
# of examples 



In syntactic parsing, though, since the task is 
to identify the phrasal components, it is more 
appropriate to measure the precision and recall: 



Precision 
Recall 



# of correct proposed phrases 

# of proposed phrases 

# of correct proposed phrases 

# of correct phrases 



To facilitate the comparison of systems with dif- 
ferent precision and recall, the F-measure metric 
is computed as a weighted harmonic mean of 
precision and recall: 



Fa 



{(3'^ + 1) X Precision x Recall 
P'^ X Precision -t- Recall 



The (3 parameter is used to give more weight to 
precision or recall, as the task at hand requires. 
In all our experiments, /? is set to 1, giving equal 
weight to precision and recall. 

The reported performances are all measured 
with th e evaluation to ol provided with the CoNLL 
corpus (ICoNLL, 20"oo|). 



4.2 Active Learning 

To demonstrate the usefulness of obtaining proba- 
bilities from a transformation rule list, this section 
describes an application which utilizes these prob- 
abilities, and compare the resulting performance 
of the system with that achieved by C4.5. 

Natural language processing has traditionally 
required large amounts of annotated data from 
which to extract linguistic properties. However, 
not all data is created equal: a normal distribu- 
tion of annot ated data contains m uch redundant 
information, ^eung et al. (199S| ) and Freund et 
al. (1997) proposed a theoretical active learning 
approach, where samples are intelligently selected 
for annotation. By eliminating redundant infor- 
mation, the same performance can be achieved 
while using fewer resources. Empirically, active 
learning has been applied to various NLP tasks 
such as text categorization ( Lewis and Gale, 1994 : 



Lewis and Catlett, 1994| ; pl/iere and Tadepalli 



1997 ), part-of-speech tagging ( Pagan and Engel _ 
son, 1995 ; Engelson and Pagan, 1996|), and bas e 



noun phrase chunking ( Ngai and Yarowsky, 200C| ) , 
resulting in significantly large reductions in the 
quantity of data needed to achieve comparable 
performance. 



This section presents two experimental results 
which show the effectiveness of the probabilities 
generated by the TBLDT. The first experiment 
compares the performance achieved by the active 
learning algorithm using TBLDT with the perfor- 
mance obtained by selecting samples sequentially 
from the training set. The second experiment 
compares the performances achieved by TBLDT 
and C4.5 training on samples selected by active 
learning. 

The following describes the active learning algo- 
rithm used in the experiments: 

1. Label an initial Ti sentences of the corpus; 

2. Use the machine learning algorithm (C4.5 or 
TBLDT) to obtain chunk probabilities on the 
rest of the training data; 

3. Choose T2 samples from the rest of the train- 
ing set, specifically the samples that optimize 
an evaluation function /, based on the class 
distribution probability of each sample; 

4. Add the samples, including their "true" classi- 
fication^ to the training pool and retrain the 
system; 

5. If a desired number of samples is reached, 
stop, otherwise repeat from Step ||. 

The evaluation function / that was used in our 
experiments is: 



fiS) 



1 ' ' 
— Y^H{C\S,z) 



where H{C\S,i) is the entropy of the chunk 
probability distribution associated with the word 
index i in sentence S. 

Figure || displays the performance (F-measure 
and chunk accuracy) of a TBLDT system trained 
on samples selected by active learning and the 
same system trained on samples selected sequen- 
tially from the corpus versus the number of words 
in the annotated training set. At each step of 
the iteration, the active learning-trained TBLDT 
system achieves a higher accuracy/F- measure, or, 
conversely, is able to obtain the same performance 
level with less training data. Overall, our system 
can yield the same performance as the sequential 
system with 45% less data, a significant reduction 
in the annotation effort. 

Figure |^ shows a comparison between two active 
learning experiments: one using TBLDT and the 
other using C4.5.^ For completeness, a sequential 
run using C4.5 is also presented. Even though 
C4.5 examines a larger space than TBLDT by 



^The true (reference or gold standard) classification is 
available in this experiment. In an annotation situation, 
the samples are sent to human annotators for labeling. 

■^As mentioned earlier, both the TBLDT and C4.5 were 
limited to the same 100 most ambiguous words in the 
corpus to ensure comparability. 



10000 15000 20000 25000 

Number of words in training set 



10000 15000 20000 25000 

Number o! words in training set 



(a) F-measure vs. number of words in training set 



(b) Chunk Accuracy vs. number of words in training 
set 



Figure 2: Performance of the TBLDT system versus sequential choice. 



AL+TBLDT{100 words) 
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Sequential+C4.5 



AL+TBL (100 words) + 
AL+C'l.S X 
Sequential+C4.5 
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5000 1D0D0 15000 20000 

Number of words in training set 



10000 15000 20000 

Number of worrfs in training riata 



(a) F-measure vs. number of words in training set 



(b) Accuracy vs. number of words in training set 



Figure 3: Performance of the TBLDT system versus the DT system 



utilizing the feature subset predicates, TBLDT 
still performs better. The difference in accuracy at 



620Q words (at the end of the active learning run al., 199E| ). 



for TBLDT) is statistically significant at a 0.0003 
level. 

As a final remark on this experiment, note that 
at an annotation level of 19000 words, the fully 
lexicalized TBLDT outperformed the C4.5 system 
by making 15% fewer errors. 

4.3 Rejection curves 

It is often very useful for a classifier to be able 
to offer confidence scores associated with its deci- 
sions. Confidence scores are associated with the 
probability P(C(x) correctjx) where C{x) is the 
classification of sample x. These scores can be 
used in real-life problems to reject samples that 
the the classifier is not sure about, in which case 
a better observation, or a human decision, might 
be requested. The performance of the classifier 
is then evaluated on the samples that were not 
rejected. This experiment framework is well- 



establishe d in machine learning and optimization 
research ( Dietterich and Bakiri, 1995| ; Priebe et 



Since non-probabilistic classifiers do not offer 
any insights into how sure they are about a 
particular classification, it is not easy to obtain 
confidence scores from them. A probabilistic 
classifier, in contrast, offers information about the 
class probability distribution of a given sample. 
Two measures that can be used in generating 
confidence scores are proposed in this section. 

The first measure, the entropy H of the class 
probability distribution of a sample x, C(x) = 
{■p{c\\x),-p{c'2\x) . . .■p(ch\x)\, is a measure of the 
uncertainty in the distribution: 



k 

E 

1=1 



p{ci\x) l0g2 piCi\x) 



The higher the entropy of the distribution of 
class probability estimates, the more uncertain the 



S 0.97 

« 

I 0.96 

3 

U 

< 0.95 
0.94 
0.93 



1 

- 


1 1 1 1 1 
TBLDT--^ 
^^^""CA.S {soft decisions)'''7^„,_. 


1 1 


1 1 




C4.5 (hard decisions) 




if' 




■■ ■■■■■■-„ J 




1 


1 


1 1 1 1 1 


1 1 


1 







0.2 0.3 0.4 0.5 0.6 0.7 0.8 
Percent of rejected data 



0.99 
0.985 
0.98 

3 0.97 
I 0.965 
5> 0.96 
3 0.955 
I 0.95 

0.945 
0.94 

0.935 








C4.5 (soft decision) 



0.2 0.4 0.6 0.8 

Probability of the most likely tag 



(a) Subcorpus (batch) rejection 



(b) Threshold (online) rejection 



Figure 4: Rejection curves. 



classifier is of its classification. The samples se- 
lected for rejection are chosen by sorting the data 
using the entropies of the estimated probabilities, 
and then selecting the ones with highest entropies. 
The resulting curve is a measure of the correlation 
between the true probability distribution and the 
one given by the classifier. 



Figure f4(a)| shows the rejection curves for the 
TBLDT system and two C4.5 decision trees - one 
which receives a probability distribution as input 
("soft" decisions on the left context) , and one 
which receives classifications ("hard" decisions on 
all fields). At the left of the curve, no samples 
are rejected; at the right side, only the samples 
about which the classifiers were most certain are 
kept (the samples with minimum entropy). Note 
that the y- values on the right side of the curve are 
based on less data, effectively introducing wider 
variance in the curve as it moves right. 

As shown in Figure 4(a) , the C4.5 classifier 
that has access to the left context chunk tag 
probability distributions behaves better than the 
other C4.5 system, because this information about 
the surrounding context allows it to effectively 
perform a shallow search of the classification 
space. The TBLDT system, which also receives 
a probability distribution on the chunk tags in 
the left context, clearly outperforms both C4.5 
systems at all rejection levels. 

The second proposed measure is based on the 
probability of the most likely tag. The assumption 
here is that this probability is representative of 
how certain the system is about the classifica- 
tion. The samples are put in bins based on 
the probability of the most likely chunk tag, and 
accuracies are computed for each bin (these bins 
are cumulative, meaning that a sample will be 
included in all the bins that have a lower threshold 
than the probability of its most likely chunk 
tag). At each accuracy level, a sample will be 
rejected if the probability of its most likely chunk 



Model 


Perplexity 


Cross Entropy 


TBLDT 


1.2944 


0.2580 


DT+probs 


1.4150 


0.3471 


DT 


1.4568 


0.3763 



Table 3: Cross entropy and perplexities for two 
C4.5 systems and the TBLDT system 

is below the accuracy level. The resulting curve 
is a measure of the correlation between the true 
distribution probability and the probability of the 
most likely chunk tag, i.e. how appropriate those 
probabilities are as confidence measures. Unlike 
the first measure mentioned before, a threshold 
obtained using this measure can be used in an 
online manner to identify the samples of whose 
classificat ion th e system is confident. 

displays the rejection 



Figure 4(b) 



curve for 

the second measure and the same three systems. 
TBLDT again outperforms both C4.5 systems, at 
all levels of confidence. 

In summary, the TBLDT system outperforms 
both C4.5 systems presented, resulting in fewer re- 
jections for the same performance, or, conversely, 
better performance at the same rejection rate. 

4.4 Perplexity and Cross Entropy 

Cross entropy is a goodness measure for probabil- 
ity estimates that takes into account the accuracy 
of the estimates as well as the classification accu- 
racy of the system. It measures the performance 
of a system trained on a set of samples distributed 
according to the probability distribution p when 
tested on a set following a probability distribution 
q. More specifically, we utilize conditional cross 
entropy, which is defined as 



H{C\X) 



E 

xex 



lix) -yaiclx) ■ log2p(c|x) 



where X is the set of examples and C is the set of 
chunk tags, q is the probability distribution on the 
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Table 4: Performance of TBLDT on the CoNLL 
Test Set 
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Table 5: Performance of C4.5 on the CoNLL Test 
Set 



test document and p is the probability distribution 
on the train corpus. 

The cross entropy metric fails if any outcome is 
given zero probability by the estimator. To avoid 
this problem, estimators are "smoothed", ensuring 
that novel events receive non-zero probabilities. 
A very simple smoothing technique (interpolation 
with a constant) was used for all of these systems. 

A closely related measure is perplexity, defined 

as 

p ^ 2^(CI-^) 

The cross entropy and perplexity results for the 
various estimation schemes are presented in Table 
|. The TBLDT outperforms both C4.5 systems, 
obtaining better cross-entropy and chunk tag per- 
plexity. This shows that the overall probability 
distribution obtained from the TBLDT system 
better matches the true probability distribution. 
This strongly suggests that probabilities generated 
this way can be used successfully in system com- 
bination techniques such as voting or boosting. 

4.5 Chunking performance 

It is worth noting that the transformation-based 

fstem used in the comparative graphs in Figure 
was not running at full potential. As described 
earlier, the TBLDT system was only allowed to 
consider words that C4.5 had access to. However, 
a comparison between the corresponding TBLDT 
curves in Figures || (where the system is given 
access to all the words) and y show that a 
transformation-based system given access to all 
the words performs better than the one with a 
restricted lexicon, which in turn outperforms the 
best C4.5 decision tree system both in terms of 
accuracy and F-measure. 

Table § shows the performance of the TBLDT 
system on the full CoNLL test set, broken down 
by chunk type. Even though the TBLDT results 
could not be compared with other published re- 
sults on the same task and data (CoNLL will 
not take place until September 2000), our system 
significantly outperforms a similar system trained 
with a C4.5 decision tree, shown in Table ||, both 
in chunk accuracy and F-measure. 



5 Conclusions 

In this paper we presented a novel way to convert 
transformation rule lists, a common paradigm in 
natural language processing, into a form that is 
equivalent in its classification behavior, but is 
capable of providing probability estimates. Using 
this approach, favorable properties of transfor- 
mation rule Hsts that makes them popular for 
language processing are retained, while the many 
advantages of a probabilistic system are gained. 

To demonstrate the efficacy of this approach, 
the resulting probabilities were tested in three 
ways: directly measuring the modeling accuracy 
on the test set via cross entropy, testing the 
goodness of the output probabilities in a active 
learning algorithm, and observing the rejection 
curves attained from these probability estimates. 
The experiments clearly demonstrate that the 
resulting probabilities perform at least as well as 
the ones generated by C4.5 decision trees, resulting 
in better performance in all cases. This proves that 
the resulting probabilistic classifier is as least as 
good as other state-of-the-art probabilistic models. 

The positive results obtained suggest that the 
probabiHstic classifier obtained from transforma- 
tion rule lists can be successfully used in machine 
learning algorithms that require soft-decision clas- 
sifiers, such as boosting or voting. Future research 
will include testi ng the behavior of the sy stem 
under AdaBoost ( Freund and Schapire, 1997 ). We 
also intend to investigate the effects that other 
decision tree growth and smoothing techniques 
may have on continued refinement of the converted 
rule Hst. 
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